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1, INTRODUCTION 



This note describes the results of a computational comparison of value 
iteration algorithms suggested for solving finite state discounted Markov 
decision processes. Such a process visits a set of states S = {1,2,...M} . 
When it is in state i , one can choose an action k from the finite 



ability p^^ the process will be in state j at the next period. The 
object is to maximize, v(i) , the maximum discounted reward over an in- 
finite horizon starting in state i , where B is the discount factor. It 
is well known [1] that v(i) satisfies the optimality equation 



We record the time for value iteration algorithms to obtain e-optimal 
solutions, V , to (1.1), (i.e. |v - v| < £ , where |v| = maxlv(i)l) 

Q ' n. ' OO * ' OO ' ' 

i 

on randomly generated problems. We look at three classes of fifteen prob- 
lems each with S = 9 and z = .0001, where v(i) 2,000. Class 1 prob- 
lems have 100 states and between 2 and 7 actions per state; class 2 have 
40 states and between 2 and 70 actions per state, whereas class 3 have 
10 states and up to 500 actions per state. Details of how the problems 
are generated and computing facilities used are given in [12]. 

In Section two we describe the schemes examined and the various bounds 
that can be used for stopping them. Section three concentrates on one 
scheme that did well in the comparison - ordinary value iteration - and 
looks at various methods for eliminating non-op timal actions both 
permanently and temporarily. 



action set K. , and then receive an immediate reward r. and with prob- 
1 1 



k 




( 1 . 1 ) 



1 



2. SCHEMES AND BOUNDS 



The scheme usually described as value iteration is 



V , , (i) = max -(v^^Ci) i = max |r*^ + g ^ P— v (j) 

kEK.l “ / kcKj " jil " 



(2 



which was discussed in [1,3]. In analogy with the notation of linear 
equations we call this Pre-Jacobi (PJ). This analogy leads us to think of 
the following alternative schemes. 



Jacobi (J) : v ,t(i) - max 

n-ri , 

kcK. 



■ (r^ + 6 I pJ. V (j))/(l-gp^ )1 

il jfi ^ i 



( 2 . 



i-1 



Gauss-Seidel (GS) : 



J k 

max < r . 


+ B 




keK. I ^ 
1 

M 

r« 


k 


Il p 


+ 2 I 


P. . 


V (j ) 


j=i 




n 


(A V )(i) ^ 
GS n 


= max 
keK. 



k 

ij 



i-1 



j=l 



k 

3 . . ' 
IJ 



( 2 . 



M 



® Pij 

J=l+1 



( 2 . 



Successive over Relaxation (SOR) : v , ^ (i) = co(A„„v )(i) + (1-<jj)v (i) (2, 

n-ri GS n n 



(PGS) was suggested by Kushner [4], Porteus [8] and Reetz [10]; (J) and 
(GS) is found in [9] and SOR in [5]. Experiments with SOR suggested a 
value of o) = 1.28 for robust and speedy convergence. 
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We require bounds on the iterates of the scheme to ensure we stop when 
is within a specified value of the optimal v of (1.1). One can use 
the norm bound, which says if 1 I Q I 1 all possible 

transition matrices in the scheme 

V = max{s^ + Q^v } (2.6) 

n*! J. . n 

— k — 

then |v - '^n+lloo 1 ^ l^n+1 (P^) , U) y (PCS) and (GS) , 

it is trivial to show the corresponding Q*s have norm less then 6 . 

For S.O.R. we estimate a by |v - v I /|v - v and substitute in 

' n+1 n'^ ' n n-l'‘» 

(2.6) to get a heuristic bound. 

Porteus [7] described tighter bounds for these schemes, exploiting the 

non-negativity of the elements q.. of Q in (2.6). They require cal- 
k V k 

culation of a. = ) q.. for the maximizing action k at each iterate, 

^ j=i 

and we call these the P.C. bounds - (Porteus with calculation). In [12] 
we describe how to estimate the initially, which avoids the calcula- 

tion at each step, but gives looser bounds, which we denote P.N.C. - 
(Porteus no calculation). For the (PJ) scheme we also use the second order 
bounds (S.O.) described in [11], which uses the last three iteration values 
to get a tighter lower bound than Porteus *s bound. 

The results are given in the following table where (Av) is the average 
C.P.U. time for solving the fifteen problems, S.D. the standard deviation 
of the C.P.U. time, and N the number of problems that method was quickest 
at solving. 
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TABLE 1 



METHOD 


BOUNDS 


CLASS 1 (100 STATE) 


CLASS 2 (40 STATE) 


CLASS : 


3 (10 STATE) f 


AV. 


S.D. 


N. 


AV. 1 

1 


S.D. 


N. 


AV. 


S.D. 


N. i 


PJ 


PC=PNC 


1.66 


.11 


15 


79 

1 


.10 

1 


i 14 

1 


.54 


.07 


4 




SO 


1.69 


.11 


0 


.80 ! 


.11 


1 


.54 


.08 


11 


! j 


L 

00 


11.88 


.59 


0 


14.38 ' 


2.27 


0 


7.66 


1.14 


0 




PNC 


10.49 


.63 


0 


11.55 


' 2.05 


0 


6.90 


1.09 


0 


! 


L 

QO 


6.86 


.33 


: 0 1 


8.62 I 


S 1.34 


0 


5.18 


.82 


0 


PCS 


PC 1 

1 


1 6.59 

[ 


.33 


; 0 1 


j 8.34 i 


1 1.32 


; 0 


5.03 


.81 


0 


i 


PNC 


6.60 


.33 


0 


8.40 


1.33 


0 


5.01 


.81 


0 




L 

oo 


6.55 


.32 


0 


8.13 


1.41 


0 


3.99 


.61 


0 


! GS 


PC 


6.32 


.34 


0 


7.77 


1.38 


0 


3.90 


.61 


0 




PNC 


6.25 


.32 


0 


7.70 


1.41 


0 


3.83 


.61 


0 


SOR 


L 

oo 


3.30 


.22 


0 


4.00 


.55 


0 


1.86 


.30 


0 



For Jacobi, the P.C. bound is the saine as the P.N.C. bound and so the 
latter must be faster as it involves less calculation. It is obvious from 
Table 1 that P.J. with Porteus bounds perforins very well, and in the next 
section we concentrate on this scheme and apply elimination of non-optimal 
actions . 



3. ACTION ELIMINATION 

MacQueen [6] described how for any bounds one can obsen/e a test to 
identify actions that cannot optimize the right hand side of (1.1) and so 
can be permanently eliminated from the calculation. Applying MacQueen^ s 
bounds [6] and Porteus ^s bound [7] for the PJ algorithm leads to the 
following tests to eliminate action k in K.^ permanently. 
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MacQueen 



(3.1) 



vj^(i) < V (i) + 3(a - b )/(l-6) 

n n n n 



Porteus 



v|^(i) < v^(i) + e^(a - b )/(l-6) 
n n n— 1 n— 1 



(3.2) 



where a = inin(v (i) - v ,(i)) , 
n . n n-1 

1 



b = max(v (i) - v .(i)) 
n . n n-1 

1 



We looked at four ways of implementing these tests, 
th 

Ml. At n iteration, calculate and store v (i) for each i . Then 

n 

calculate a and b . Recalculate v (i) and use (3.1) to test 
n n n 

for elimination. 

th k 

M2. At n stage, calculate and store all v (i) . Hence calculate 

n 

V (i) , a , b and test for elimination using (3.1) without 

n n n o ^ ^ 

recalculating • 

t h k 

PI. At n+1 stage, calculate ’ starting with action k that 

maximized previous stage. Apply (3.2) as soon as you cal- 

culate each using as d the one that gives maximum 

V , , (i) so far calculated, see [7]. 
n+1 

t h k 

P2. At n+1 stage, calculate and store • Ifien using 

apply (3.2). 

As Table 2 shows M2 is far superior to Ml, but PI and P2 give similar 

results. All three cut the average time by a half though. 

Hastings and Van Nunen [2] pointed out that one could also eliminate 

actions temporarily, i-e. actions that will not be the optimizing actions 
at the next iteration of the PJ algorithm. This is based on the inequality 



v^,(i) - v^^^(i) > v^(i) - v^(i) - i I (h 



n+s 



n 



n 



n+j-1 ^n+j-1^ 



(3.3) 
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If the R.H.S. of (3.3) is positive k will not optimize the n + s^ 
iteration, and in that case, at the n + s + 1 — iteration we need only 



subtract another 3(b ~ ^ ) from this positive number to test if k 

n 1 s n ■ s 



could be optimal. If action k is not eliminated at the n + 



th 



iteration, v (i) - v , (i) is stored for the test at the next iteration, 
n+s n+s 

We looked at four ways of implementing these two elimination procedures. 

t h 

Recall that the n + 1— iteration consists of the following sequence of 
calculations. 



(I) 



a ,b 
n n 



k , . V 



(II) 



(III) 






^n+rVl 






TEMP HVN. Hastings and Van Nunen [2] suggested the temporary elimination 

test be made at (I) and if k was not temporarily eliminated then v^_j_^(i) 

was calculated. The permanent eldlmination test was made at (II) using 

(3.2) with v^,,(i) replaced by a lower bound v (i) + ga . If the 
n+1 n n 

action is not permanently eliminated, v (i) + ga - v (i) (rather than 

n n n+1 

v^_j_^(i) - ^n+1^^^^ stored TEMP + PI. Temporary elimination occurs at 

(I) and permanent elimination at (II) using PI . 

TEMP + P2. Again this has temporary elimination at (I) and permanent 
elimination at (III) using P2 . 

TEMP + M2. Temporary elimination occurs at (I) and in this case 

was stored until (IV) and then the M2 technique used. If action k was 

k 

not eliminated ” ^n+1^^^ stored for the temporary elimination 

test of the next iteration, which followed immediately. 

In this case when permanent and temporary elimination are done at 
the same stage, it is obvious that any action which is permanently eliminated 
would also be temporarily eliminated. 
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This leads us to ask is it worth permanently eliminating, so 
■^LY Perform the temporary elimination test at (I) • If action k is not 
eliminated v^^^(i) is stored through state (II) and changed to 

at stage (III) for the next temporary elimination test 

at (IV). 

Table 2 describes the results and shows that temporary elimination 
further cuts the time by 25%, and that pure temporary elimination might 
be particularly good on large scale problems. 



TABLE 2 



METHOD 


CLASS 1 (100 STATE) 


CLASS 2 (40 STATE) 


CLASS 3 (10 STATE) 


AV. 


S.D. 


N. 


AV. 


S.D. 


N. 


AV. 


1 S.D. 


No 


Ml 


1.51 


.09 


0 


0.67 


* i 

.08 


0 

1 


0.42 

1 . . 


1 1 

! .06 


0 


M2 


0.81 


.05 


15 


0.36 .04 


15 


0.25 


1 .03 

1 


15 


Pi 


0.87 


.05 


0 


0.43 .05 


0 


0.26 


1 .03 1 0 


P2 


0.88 


,05 


0 


0.45 .05 


0 


00 

o 


' .04 

1 1 


0 


TEMP HVN ! 0.62 

i 


.03 


0 


0.27 . 0,03 


0 


0.21 


.03 j 0 


TEMP + PI ; 0.60 


.04 


0 


0.25 


.02 


3 


0.20 


.03 


11 


TEMP + P2 0.58 


.03 


0 


0.26 


.02 


0 ! 0.23 


.03 


0 


TEMP + M2 


0.59 


.04 


0 


0.22 


.04 


2 ' 0.21 


.03 


4 


TEMP ONLY 


0.55 


.03 


15 


0.22 


.04 


10 


1 

0.21 


.03 


0 1 



Our object has not been to obtain a best buy, but to give some idea of 
the merits of the various schemes, bounds and improvements. Obviously for 
more structured problems, algorithms which exploit the structure will be 
at an advantage. 
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